Data Exploration Red Wine Quality by Yuchen_Yeh

Dataset citation:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

This report explores a dataset containing 1,500 red wines with 12 variables on the chemical properties of the wine.

Univariate Plots Section

Summary of the data set

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

This sample data set is very tidy, and there are no missing values. Some variables saw large variances: the mean of residual sugar is 2.539 but the max value is 15.5. The max value of chlorides 0.611 (almost 7 times higher than the mean). Total sulfur dioxide ranges from 6 to 289.

Quality of red wine

##   x freq
## 1 3   10
## 2 4   53
## 3 5  681
## 4 6  638
## 5 7  199
## 6 8   18

Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). For this red wine sample, the lowest grade is 3 and the highest grade is 8. Majority scored between 5 and 6 in an average range.

pH level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Most red wines samples are between 3.0 - 3.5 on the pH scale. In general, the pH level of most wines is between 3-4, and I noticed in this sample the lowest pH is 2.74 and the highest pH is 4.010.

Below, we subset the wine quality to see the distribution of pH level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.380   3.384   3.500   3.900

It shows that the pH distribution is varied depending on wine quality. Both distributions are normal but with different means and variances. The mean of high quality is slightly less than the mean of low quality, but the variance of low quality is much bigger.

Density of red wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

The density of most wines is very close to the density of water, and the density distribution is normal with values ranged from 0.9901 to 1.0040.

Alchohol %

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Alcohol level distribution looks skewed left. Most frequently wine samples have 9.5% alcohol.

Acidity level: fixed acidity and volatile acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Fixed acidity, that does not evaporate readily, in most wines are in a range of between 7 and 9. In terms of volatile acidity, most wines have between 0.3 and 0.7. Around 150 wines have a high volatile acidity of above 8, which can lead to an unpleasant, vinegar taste.

Citric acid level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Citric acid can add ‘freshness’ and flavour to wines, and most wine samples have a different level, ranging from 0.05 to 0.5. However, there are fewer wines with citric acid of more than 0.5.

Residual sugar level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

The distribution of residual sugar for this red wine samples is heavily skewed to left, which means the wine samples tend to be less sweet. To examine closely the residual sugar distribution between 1 and 4, the most frequent values are between 1.8 and 2.3.

Chlorides level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides represent the amount of salt in the wine. In this red wine data set, the most frequent value of chlorides is 0.1. To see the distribution of chlorides clearer I limited the data between 0 and 0.2 to find a normal distribution with a mean of around 0.08.

Sulfur dioxide level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free from of sulfur dioxide prevents microbial growth and the oxidation of wine, and most wines have in the range of 0 - 20.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50

The total sulfur dioxide level in most wines vary between 0 - 100. To estimate the level of bound form of sulfur dioxide, I calculated the difference in a new variable called bound sulfur dioxide, and actually two-thirds of bound sulfur dioxide values is between 0 - 40.

Sulphates level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates are a wine additive which can contribute to sulfur dioxide levels, and 0.5 - 1 of sulphates are observed in most wines.

Univariate Analysis

What is the structure of your dataset?

There are 1,500 red wines in the dataset with 12 variables on the chemical properties of the wine (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and quality). The only categorical variable is quality. All other variables are continuous variables.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are pH values and quality. I’d like to know if pH level decides the quality of wine. I suspect some other combined variables are also likely to help build a predicted model to grade wine quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Volatile.acidity, citric acid, residual sugar and free sulfur dioxide likely contribute quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created a new variable called bound sulfur dioxide by calculating the difference of total sulfur dioxide and free sulfur dioxide.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of residual sugar is heavily skewed to left, so I have to subset just part of the data between 0 and 4 to see better distribution. For chlorides, the distribution is also left-skewed, and I limited the data between 0 and 0.2.

Bivariate Plots Section

Coefficient matrix

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
## bound.sulfur.dioxide         -0.08             0.10        0.07
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
## bound.sulfur.dioxide           0.17      0.06                0.43
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
## bound.sulfur.dioxide                 0.96    0.10 -0.11      0.03   -0.22
##                      quality bound.sulfur.dioxide
## fixed.acidity           0.12                -0.08
## volatile.acidity       -0.39                 0.10
## citric.acid             0.23                 0.07
## residual.sugar          0.01                 0.17
## chlorides              -0.13                 0.06
## free.sulfur.dioxide    -0.05                 0.43
## total.sulfur.dioxide   -0.19                 0.96
## density                -0.17                 0.10
## pH                     -0.06                -0.11
## sulphates               0.25                 0.03
## alcohol                 0.48                -0.22
## quality                 1.00                -0.21
## bound.sulfur.dioxide   -0.21                 1.00

From the correlation matrix, pH, residual sugar and free sulfur dioxide seem to have no correlations with quality, but quality has a negative moderate relationship with volatile acidity (-0.39), a positive moderate relationship (+0.48) with alcohol and a positive weak relationship with citric acid (0.23).

Apart from the positive strong relationship between total sulfur dioxide and bound sulfur dioxide due to the previous calculation, there are three strong relationships in this data sets that I want to explore: a strong negative relationship (-0.50) between alcohol and density. a strong negative relationship (-0.55) between volatile.acidity and citric.acid. a strong negative relationship (-0.54) between pH and citric.acid,

Quality with alcohol, volatile acidity and citric acid

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

## wqr$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wqr$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wqr$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wqr$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wqr$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wqr$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

The trend between quality and alcohol is very clear that when the alcohol percentage increases the quality improves. The relationship between quality and volatile acidity is negative, which means better quality is observed in lower volatile acidity. The slope is less steep between quality and citric acid, but it shows a higher level of citric acid contributes to a better quality of wine.

Density & alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$density and wqr$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

When density increases, the alcohol level decreases.

Volatile acidity & citric acid

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$citric.acid and wqr$volatile.acidity
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Volatile acidity and citric acid also demonstrated a negative relationship. It is interesting to see that both variables are related to quality in some degree: volatile acidity has a negative relationship with quality and citric acid has a positive relationship with quality.

Fixed acidity & pH

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$pH and wqr$fixed.acidity
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

It is not a surprise to see the strongest relationship is between pH and fixed acidity, as pH scale measure how acidic a substance is.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There are only two variables showing a slightly stronger relationship with quality: a negative moderate relationship with volatile acidity (-0.39), a positive moderate relationship (+0.48) with alcohol. Surprisingly, the pH level has a very weak relationship with quality (-0.06).

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

A total of 6 strong relationships: a strong negative relationship (-0.50) between alcohol and density. a strong negative relationship (-0.55) between volatile.acidity and citric.acid. a strong negative relationship (-0.68) between fixed acidity and pH, a strong negative relationship (-0.54) between pH and citric.acid, a strong positive relationship (+0.67) between fixed acidity and density, a strong positive relationship (+0.67) between fixed acidity and citric.acid.

What was the strongest relationship you found?

It is obvious that total sulfur dioxide and bound sulfur dioxide have a very strong relationship (+0.96) as it was calculated by subtracting free sulfur dioxide. The strongest relationship is between pH and fixed acidity (-0.68), which is again quite reasonable as pH level goes against fixed acidity.

Multivariate Plots Section

Density & alcohol by quality cut

## [1] "Mean of density by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.9966887
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 0.9968673
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 0.9960303
## [1] "Mean of alcohol % by quality cut"
## wqr$quality.cut: (0,4]
## [1] 10.21587
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 10.25272
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 11.51805

High-quality wine [6,10] have a higher percentage of alcohol with a varied range of density between 0.990 and 1.000. On the other hand, low-quality wine [0,4] has a lower percentage of alcohol with the density in a more defined range of 0.995 and 1.000.

Volatile acidity & citric acid by quality cut

## [1] "Mean of fixed acidity by quality cut"
## wqr$quality.cut: (0,4]
## [1] 7.871429
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 8.254284
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 8.847005
## [1] "Mean of citric acid by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.1736508
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 0.2582638
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 0.3764977

Low-quality wines see lower fixed acidity, especially with a lower citric acid. High-quality wines have a combination of higher citric acid and higher fixed acidity.

Citric acid & pH by quality cut

## [1] "Mean of pH by quality cut"
## wqr$quality.cut: (0,4]
## [1] 3.384127
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 3.311296
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 3.288802
## [1] "Mean of citric acid by quality cut"
## wqr$quality.cut: (0,4]
## [1] 0.1736508
## -------------------------------------------------------- 
## wqr$quality.cut: (4,6]
## [1] 0.2582638
## -------------------------------------------------------- 
## wqr$quality.cut: (6,10]
## [1] 0.3764977

High-quality wines have a combination of lower pH and higher citric acid, while low-quality wines sees a level of higher pH and lower citric acid.

Free form ratio in total sulfur dioxide and quality

## 
##  Pearson's product-moment correlation
## 
## data:  wqr$ratio and wqr$quality
## t = 7.9077, df = 1597, p-value = 4.854e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1464861 0.2408427
## sample estimates:
##       cor 
## 0.1941134

It does not seem to have any strong relationship (+0.19) between the ratio of a free form of sulfur dioxide and quality of the red wines.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

High-quality wines have a combination of lower pH and higher citric acid or a combination of higher citric acid and higher fixed acidity. Low-quality wines see a mix of higher pH and lower citric acid or a mix of lower fixed acidity and lower citric acid.

Were there any interesting or surprising interactions between features?

High-quality wines have a higher percentage of alcohol but doesn’t correlate with any particular of density.It is surprising to note that there is no correlation between the ratio of a free form of sulfur dioxide and quality of the red wines.


Final Plots and Summary

Plot One

Description One

A total of 1319 red wine samples (more than 80%) are graded 5 and 6, and there are no wine samples being marked less than 3 or more than 8.

Plot Two

Description Two

A positive moderate relationship (0.48) is observed between alcohol percentage and wine quality, which mean wine quality grows when alcohol percentage increases. It it notable that the mean of alcohol percentage between quality 4 and quality 5 doesn’t show a linear growth (10% vs 9.7%).

Plot Three

Description Three

There is a clear pattern that low-quality wines [0,4] tend to have lower fixed acidity and a lower citric acid and high-quality wines [6, 10] have a combination of high citric acid and high fixed acidity. More specifically, low- quality wines have a mean of fixed acidity of 7.87 and a mean of citric acid of 0.17. For high-quality wines have a mean of fixed acidity of 8.84 and a mean of citric acid of 0.37.


Reflection

I found I don’t need to do data wrangling for this sample data set as it is very tidy. To my surprise, pH level doesn’t have a strong correlation with quality.I also didn’t see any strong correlation between the ratio of a free form of sulfur dioxide and quality of the red wines.

In my bivariate analysis, I discovered alcohol and volatile acidity have a moderate relationship with quality. My multivariate analysis shows high-quality wines have a combination of lower pH and higher citric acid or a combination of higher citric acid and higher fixed acidity. On the other hand, low-quality wines see a mix of higher pH and lower citric acid or a mix of lower fixed acidity and lower citric acid.

However, I don’t think I can confidently say that a certain combination of variables proves to provide good quality wines. It seems like the quality grading by experts doesn’t purely based on 12 variables provided in the data set. It would be interesting to analyse the red quality by the year of production, the place of the product, etc.